For excerises 1-4, use the Yelp business json file. For exercises 5-6, use the Yelp review json file.

Exercise 1: Create a new column that contains only the zipcode.



In [2]:

    
import pandas as pd
import numpy as np
# read json, lines=True because each line is a single json
business = pd.read_json('Yelp/yelp_academic_dataset_business.json', lines=True)
print(business.shape)
business.head()









    



(85901, 15)






    Out[2]:






  
    
      
      attributes
      business_id
      categories
      city
      full_address
      hours
      latitude
      longitude
      name
      neighborhoods
      open
      review_count
      stars
      state
      type
    
  
  
    
      0
      {'Take-out': True, 'Drive-Thru': False, 'Good ...
      5UmKMjUEUNdYWqANhGckJw
      [Fast Food, Restaurants]
      Dravosburg
      4734 Lebanon Church Rd\nDravosburg, PA 15034
      {'Friday': {'close': '21:00', 'open': '11:00'}...
      40.354327
      -79.900706
      Mr Hoagie
      []
      True
      7
      3.5
      PA
      business
    
    
      1
      {'Happy Hour': True, 'Accepts Credit Cards': T...
      UsFtqoBl7naz8AVUBZMjQQ
      [Nightlife]
      Dravosburg
      202 McClure St\nDravosburg, PA 15034
      {}
      40.350553
      -79.886814
      Clancy's Pub
      []
      True
      5
      3.0
      PA
      business
    
    
      2
      {'Good for Kids': True}
      cE27W9VPgO88Qxe4ol6y_g
      [Active Life, Mini Golf, Golf]
      Bethel Park
      1530 Hamilton Rd\nBethel Park, PA 15234
      {}
      40.354115
      -80.014660
      Cool Springs Golf Center
      []
      False
      5
      2.5
      PA
      business
    
    
      3
      {'Alcohol': 'full_bar', 'Noise Level': 'averag...
      mVHrayjG3uZ_RLHkLj-AMg
      [Bars, American (New), Nightlife, Lounges, Res...
      Braddock
      414 Hawkins Ave\nBraddock, PA 15104
      {'Tuesday': {'close': '19:00', 'open': '10:00'...
      40.408830
      -79.866211
      Emil's Lounge
      []
      True
      26
      4.5
      PA
      business
    
    
      4
      {'Parking': {'garage': False, 'street': False,...
      mYSpR_SLPgUVymYOvTQd_Q
      [Active Life, Golf]
      Braddock
      1000 Clubhouse Dr\nBraddock, PA 15104
      {'Sunday': {'close': '15:00', 'open': '10:00'}...
      40.403405
      -79.855782
      Grand View Golf Club
      []
      True
      3
      5.0
      PA
      business



In [9]:

    
# extract five consecutive digits after a white space at the end of the address
business['zipcode'] = business['full_address'].str.extract('(\s\d{5}$)', expand=True)
business[['zipcode', 'full_address']].tail()









    Out[9]:






  
    
      
      zipcode
      full_address
    
  
  
    
      85896
      89120
      3899 East Sunset Rd\nSte 105\nLas Vegas, NV 89120
    
    
      85897
      85338
      525 N Estrella Pkwy\nSte 100\nGoodyear, AZ 85338
    
    
      85898
      85020
      9215 North 7th St\nPhoenix, AZ 85020
    
    
      85899
      NaN
      16 Waterloo Place\nOld Town\nEdinburgh EH1 3EG
    
    
      85900
      28208
      Charlotte Douglas International Airport Termin...

Exercise 2: The table contains a column called 'categories' and each entry in this column is populated by a list. We are interested in those businesses that restaurants. Create a new column 'Restaurant_type' that contains a description of the restaurant based on the other elements of 'categories.

That is, if we have '[Sushi Bars, Japanese, Restaurants]' in categories the 'Restaurant_type will be '{'SushiBars': 1, 'Japanese': 1, 'Mexican': 0, ...}'



In [10]:

    
# create a set containing all the possible categories
cat = set([])

for row in business['categories']:
    if 'Restaurants' in row:
        cat = cat.union(set(row))

cat = cat.difference(set(['Restaurants']))



In [11]:

    
# function that creates a dictionary based on the previous categories and sets the values to 1 if the category is presenti in the argument
def cat_dict(row):
    if 'Restaurants' in row:
        d = dict.fromkeys(cat, 0)
        for e in row:
            if e != 'Restaurants':
                d[e] = 1
        return d
    else:
        return np.NaN



In [12]:

    
business['Restaurant_type'] = business['categories'].apply(cat_dict)



In [13]:

    
business[['Restaurant_type', 'categories']].head()









    Out[13]:






  
    
      
      Restaurant_type
      categories
    
  
  
    
      0
      {'Taiwanese': 0, 'Burgers': 0, 'Scottish': 0, ...
      [Fast Food, Restaurants]
    
    
      1
      NaN
      [Nightlife]
    
    
      2
      NaN
      [Active Life, Mini Golf, Golf]
    
    
      3
      {'Taiwanese': 0, 'Burgers': 0, 'Scottish': 0, ...
      [Bars, American (New), Nightlife, Lounges, Res...
    
    
      4
      NaN
      [Active Life, Golf]

Exercise 3: Lets clean the 'attributes' column. The entries in this column are dictionaries. We need to do two things:

1) Turn all the True or False in the dictionary to 1s and 0s.

2) There are some entries within dictionaries that are dictionaries themselves, lets turn the whole entry into just one dictionary, for example if we have

'{'Accepts Credit Cards': True, 'Alcohol': 'none','Ambience': {'casual': False,'classy': False}}'

then turn it into

'{'Accepts Credit Cards':1, 'Alcohol_none': 1, 'Ambience_casual': 0, 'Ambience_classy': 0}'.

There might be other entries like {'Price Range': 1} where the values are numerical so we might want to change that into {'Price_Range_1': 1}.

The reason we modify categorical variables like this is that machine learning algorithms cannot interpret textual data like "True" and "False". They need numerical inputs such as 1 and 0.



In [14]:

    
# if the value is a boolean replaces the spaces in the key with underscores and returns 0-1 instead of True-False
# else replaces the spaces in the key with underscores, adds the value to the key and returns 1
def replace_entry(key, value):
    if type(value) is bool:
        if value:
            return key.replace(' ', '_'), 1
        else:
            return key.replace(' ', '_'), 0
    else:
        return key.replace(' ', '_') + '_' + str(value).replace(' ', '_'), 1

# returns the dictionary after applying the function above, navigates through the dictionary inside the values
def clean_dict(dct):
    
    out_dct = {}

    for key in dct:
        value = dct[key]
        if type(value) is dict:
            for subkey in value:
                new_key, new_value = replace_entry(subkey, value[subkey])
                out_dct[key.replace(' ', '_') + '_' + new_key] = new_value
        else:
            new_key, new_value = replace_entry(key, value)
            out_dct[new_key] = new_value

    return out_dct

business['attributes'] = business['attributes'].apply(clean_dict)



In [15]:

    
business[['attributes']].head(10)









    Out[15]:






  
    
      
      attributes
    
  
  
    
      0
      {'Take-out': 1, 'Drive-Thru': 0, 'Good_For_des...
    
    
      1
      {'Happy_Hour': 1, 'Accepts_Credit_Cards': 1, '...
    
    
      2
      {'Good_for_Kids': 1}
    
    
      3
      {'Alcohol_full_bar': 1, 'Noise_Level_average':...
    
    
      4
      {'Parking_garage': 0, 'Parking_street': 0, 'Pa...
    
    
      5
      {'Alcohol_full_bar': 1, 'Noise_Level_loud': 1,...
    
    
      6
      {'Accepts_Credit_Cards': 1}
    
    
      7
      {'Good_for_Kids': 1}
    
    
      8
      {}
    
    
      9
      {}

Exercise 4: Create a new column for every day of the week and fill it with the amount of hours the business is open that day.

Your approach should handle businesses that stay open late like bars and nightclubs.



In [16]:

    
# dct = {'Friday': {'close': '21:00', 'open': '11:00'},
#  'Monday': {'close': '21:00', 'open': '11:00'},
#  'Thursday': {'close': '21:00', 'open': '11:00'},
#  'Tuesday': {'close': '21:00', 'open': '11:00'},
#  'Wednesday': {'close': '21:00', 'open': '11:00'}}

dct = {'Friday': {'close': '02:00', 'open': '11:00'},
 'Monday': {'close': '02:00', 'open': '11:00'},
 'Saturday': {'close': '02:00', 'open': '12:00'},
 'Sunday': {'close': '22:00', 'open': '12:00'},
 'Thursday': {'close': '02:00', 'open': '11:00'},
 'Tuesday': {'close': '02:00', 'open': '11:00'},
 'Wednesday': {'close': '02:00', 'open': '11:00'}}

day_map = {'Friday': '2017-08-04',
 'Monday': '2017-07-31',
 'Saturday': '2017-08-05',
 'Sunday': '2017-08-06',
 'Thursday': '2017-08-03',
 'Tuesday': '2017-08-01',
 'Wednesday': '2017-08-02'}



In [17]:

    
def daily_hours(dct):
    # initialize output
    hours = dict.fromkeys(range(0, 7), 0)
    for day in dct:
        # set start datetime, end datetime and end of day datetime
        start = pd.to_datetime(day_map[day] + ' ' + dct[day]['open'])
        end = pd.to_datetime(day_map[day] + ' ' + dct[day]['close'])
        end_of_day = pd.to_datetime(day_map[day] + ' 00:00:00') + pd.Timedelta(1, unit='D')
        # if closing hours is in the next day add one to the end and add to output the two days
        if dct[day]['close'] < dct[day]['open']:
            end = end + pd.Timedelta(1, unit='D')
            diff = end_of_day - start
            hours[start.dayofweek] += int(np.floor(diff.seconds / 3600))
            diff = end - end_of_day
            hours[end_of_day.dayofweek] += int(np.floor(diff.seconds / 3600))
        # else add only the current day
        else:
            diff = end - start
            hours[start.dayofweek] += int(np.floor(diff.seconds / 3600))
    return hours[0], hours[1], hours[2], hours[3], hours[4], hours[5], hours[6]

business['Monday'], business['Tuesday'], business['Wednesday'], business['Thursday'], business['Friday'], business['Saturday'], business['Sunday'] = zip(*business['hours'].map(daily_hours))

#daily_hours(dct)



In [8]:

    
# solution code:
FMT = '%H:%M'

def Time(row, date):
    try:
        closing = pd.datetime.strptime(row['hours'][date]['close'], FMT)
        opening = pd.datetime.strptime(row['hours'][date]['open'], FMT)
        time = closing - opening
        
        if time < pd.Timedelta(days=0):
            time += pd.Timedelta(days=1)
        
    except:
        time = pd.Timedelta(days=0)
    
    return time

business.apply(lambda x: Time(x, 'Monday'), axis=1).head(10)









    Out[8]:





0   10:00:00
1   00:00:00
2   00:00:00
3   00:00:00
4   00:00:00
5   15:00:00
6   10:30:00
7   00:00:00
8   00:00:00
9   00:00:00
dtype: timedelta64[ns]



In [18]:

    
business[['hours', 'Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']].head(10)









    Out[18]:






  
    
      
      hours
      Monday
      Tuesday
      Wednesday
      Thursday
      Friday
      Saturday
      Sunday
    
  
  
    
      0
      {'Friday': {'close': '21:00', 'open': '11:00'}...
      10
      10
      10
      10
      10
      0
      0
    
    
      1
      {}
      0
      0
      0
      0
      0
      0
      0
    
    
      2
      {}
      0
      0
      0
      0
      0
      0
      0
    
    
      3
      {'Tuesday': {'close': '19:00', 'open': '10:00'...
      0
      9
      9
      9
      10
      6
      0
    
    
      4
      {'Sunday': {'close': '15:00', 'open': '10:00'}...
      0
      0
      9
      9
      9
      9
      5
    
    
      5
      {'Monday': {'close': '02:00', 'open': '11:00'}...
      13
      15
      15
      15
      15
      14
      12
    
    
      6
      {'Monday': {'close': '18:00', 'open': '07:30'}...
      10
      10
      10
      10
      10
      8
      0
    
    
      7
      {}
      0
      0
      0
      0
      0
      0
      0
    
    
      8
      {}
      0
      0
      0
      0
      0
      0
      0
    
    
      9
      {}
      0
      0
      0
      0
      0
      0
      0

Exercise 5: Create a table with the average review for a business.

You will need to pull in a new json file and merge DataFrames for the next 2 exercises.



In [19]:

    
review = pd.read_json('Yelp/yelp_academic_dataset_review.json', lines=True)
print(review.shape)
review.head()









    



(2685066, 8)






    Out[19]:






  
    
      
      business_id
      date
      review_id
      stars
      text
      type
      user_id
      votes
    
  
  
    
      0
      5UmKMjUEUNdYWqANhGckJw
      2012-08-01
      Ya85v4eqdd6k9Od8HbQjyA
      4
      Mr Hoagie is an institution. Walking in, it do...
      review
      PUFPaY9KxDAcGqfsorJp3Q
      {'funny': 0, 'useful': 0, 'cool': 0}
    
    
      1
      5UmKMjUEUNdYWqANhGckJw
      2014-02-13
      KPvLNJ21_4wbYNctrOwWdQ
      5
      Excellent food. Superb customer service. I mis...
      review
      Iu6AxdBYGR4A0wspR9BYHA
      {'funny': 0, 'useful': 0, 'cool': 0}
    
    
      2
      5UmKMjUEUNdYWqANhGckJw
      2015-10-31
      fFSoGV46Yxuwbr3fHNuZig
      5
      Yes this place is a little out dated and not o...
      review
      auESFwWvW42h6alXgFxAXQ
      {'funny': 1, 'useful': 1, 'cool': 0}
    
    
      3
      5UmKMjUEUNdYWqANhGckJw
      2015-12-26
      pVMIt0a_QsKtuDfWVfSk2A
      3
      PROS: Italian hoagie was delicious.  Friendly ...
      review
      qiczib2fO_1VBG8IoCGvVg
      {'funny': 0, 'useful': 0, 'cool': 0}
    
    
      4
      5UmKMjUEUNdYWqANhGckJw
      2016-04-08
      AEyiQ_Y44isJmNbMTyoMKQ
      2
      First the only reason this place could possibl...
      review
      qEE5EvV-f-s7yHC0Z4ydJQ
      {'funny': 0, 'useful': 1, 'cool': 0}



In [20]:

    
average_review = review[['business_id', 'stars']].groupby('business_id').mean()



In [21]:

    
average_review.head()









    Out[21]:






  
    
      
      stars
    
    
      business_id
      
    
  
  
    
      --0ZoBTQWQra1FxD4rBWmg
      2.000000
    
    
      --1emggGHgoG6ipd_RMb-g
      3.333333
    
    
      --4Pe8BZ6gj57VFL5mUE8g
      2.750000
    
    
      --5jkZ3-nUPZxUvtcbr8Uw
      4.580357
    
    
      --7PRjnsjMA6uhPK8mW13Q
      2.666667

Exercise 6: Create a new table that only contains restaurants with the following schema:

Business_Name | Restaurant_type | Friday hours | Saturday hours | Attributes | Zipcode | Average Rating



In [27]:

    
df = pd.merge(business[['business_id', 'name' , 'Restaurant_type', 'Friday', 'Saturday', 'attributes', 'zipcode']], average_review, left_on='business_id', right_index=True, how='inner')
df.drop('business_id', axis=1, inplace=True)
names = ['Business_Name', 'Restaurant_type', 'Friday hours', 'Saturday hours', 'Attributes', 'Zipcode', 'Average Rating']
df.columns = names
#remove restaurants
df = df[df['Restaurant_type'] == df['Restaurant_type']]
df.head()









    Out[27]:






  
    
      
      Business_Name
      Restaurant_type
      Friday hours
      Saturday hours
      Attributes
      Zipcode
      Average Rating
    
  
  
    
      0
      Mr Hoagie
      {'Taiwanese': 0, 'Burgers': 0, 'Scottish': 0, ...
      10
      0
      {'Take-out': 1, 'Drive-Thru': 0, 'Good_For_des...
      15034
      3.428571
    
    
      3
      Emil's Lounge
      {'Taiwanese': 0, 'Burgers': 0, 'Scottish': 0, ...
      10
      6
      {'Alcohol_full_bar': 1, 'Noise_Level_average':...
      15104
      4.680000
    
    
      5
      Alexion's Bar & Grill
      {'Taiwanese': 0, 'Burgers': 0, 'Scottish': 0, ...
      15
      14
      {'Alcohol_full_bar': 1, 'Noise_Level_loud': 1,...
      15106
      3.894737
    
    
      12
      Kings Family Restaurant
      {'Taiwanese': 0, 'Burgers': 1, 'Scottish': 0, ...
      18
      18
      {'Take-out': 1, 'Drive-Thru': 0, 'Good_For_des...
      15106
      3.250000
    
    
      17
      Rocky's Lounge
      {'Taiwanese': 0, 'Burgers': 0, 'Scottish': 0, ...
      12
      12
      {'Alcohol_full_bar': 1, 'Noise_Level_average':...
      15106
      3.800000

	attributes	business_id	categories	city	full_address	hours	latitude	longitude	name	neighborhoods	open	review_count	stars	state	type
0	{'Take-out': True, 'Drive-Thru': False, 'Good ...	5UmKMjUEUNdYWqANhGckJw	[Fast Food, Restaurants]	Dravosburg	4734 Lebanon Church Rd\nDravosburg, PA 15034	{'Friday': {'close': '21:00', 'open': '11:00'}...	40.354327	-79.900706	Mr Hoagie	[]	True	7	3.5	PA	business
1	{'Happy Hour': True, 'Accepts Credit Cards': T...	UsFtqoBl7naz8AVUBZMjQQ	[Nightlife]	Dravosburg	202 McClure St\nDravosburg, PA 15034	{}	40.350553	-79.886814	Clancy's Pub	[]	True	5	3.0	PA	business
2	{'Good for Kids': True}	cE27W9VPgO88Qxe4ol6y_g	[Active Life, Mini Golf, Golf]	Bethel Park	1530 Hamilton Rd\nBethel Park, PA 15234	{}	40.354115	-80.014660	Cool Springs Golf Center	[]	False	5	2.5	PA	business
3	{'Alcohol': 'full_bar', 'Noise Level': 'averag...	mVHrayjG3uZ_RLHkLj-AMg	[Bars, American (New), Nightlife, Lounges, Res...	Braddock	414 Hawkins Ave\nBraddock, PA 15104	{'Tuesday': {'close': '19:00', 'open': '10:00'...	40.408830	-79.866211	Emil's Lounge	[]	True	26	4.5	PA	business
4	{'Parking': {'garage': False, 'street': False,...	mYSpR_SLPgUVymYOvTQd_Q	[Active Life, Golf]	Braddock	1000 Clubhouse Dr\nBraddock, PA 15104	{'Sunday': {'close': '15:00', 'open': '10:00'}...	40.403405	-79.855782	Grand View Golf Club	[]	True	3	5.0	PA	business

	zipcode	full_address
85896	89120	3899 East Sunset Rd\nSte 105\nLas Vegas, NV 89120
85897	85338	525 N Estrella Pkwy\nSte 100\nGoodyear, AZ 85338
85898	85020	9215 North 7th St\nPhoenix, AZ 85020
85899	NaN	16 Waterloo Place\nOld Town\nEdinburgh EH1 3EG
85900	28208	Charlotte Douglas International Airport Termin...

	Restaurant_type	categories
0	{'Taiwanese': 0, 'Burgers': 0, 'Scottish': 0, ...	[Fast Food, Restaurants]
1	NaN	[Nightlife]
2	NaN	[Active Life, Mini Golf, Golf]
3	{'Taiwanese': 0, 'Burgers': 0, 'Scottish': 0, ...	[Bars, American (New), Nightlife, Lounges, Res...
4	NaN	[Active Life, Golf]

	attributes
0	{'Take-out': 1, 'Drive-Thru': 0, 'Good_For_des...
1	{'Happy_Hour': 1, 'Accepts_Credit_Cards': 1, '...
2	{'Good_for_Kids': 1}
3	{'Alcohol_full_bar': 1, 'Noise_Level_average':...
4	{'Parking_garage': 0, 'Parking_street': 0, 'Pa...
5	{'Alcohol_full_bar': 1, 'Noise_Level_loud': 1,...
6	{'Accepts_Credit_Cards': 1}
7	{'Good_for_Kids': 1}
8	{}
9	{}

	business_id	date	review_id	stars	text	type	user_id	votes
0	5UmKMjUEUNdYWqANhGckJw	2012-08-01	Ya85v4eqdd6k9Od8HbQjyA	4	Mr Hoagie is an institution. Walking in, it do...	review	PUFPaY9KxDAcGqfsorJp3Q	{'funny': 0, 'useful': 0, 'cool': 0}
1	5UmKMjUEUNdYWqANhGckJw	2014-02-13	KPvLNJ21_4wbYNctrOwWdQ	5	Excellent food. Superb customer service. I mis...	review	Iu6AxdBYGR4A0wspR9BYHA	{'funny': 0, 'useful': 0, 'cool': 0}
2	5UmKMjUEUNdYWqANhGckJw	2015-10-31	fFSoGV46Yxuwbr3fHNuZig	5	Yes this place is a little out dated and not o...	review	auESFwWvW42h6alXgFxAXQ	{'funny': 1, 'useful': 1, 'cool': 0}
3	5UmKMjUEUNdYWqANhGckJw	2015-12-26	pVMIt0a_QsKtuDfWVfSk2A	3	PROS: Italian hoagie was delicious. Friendly ...	review	qiczib2fO_1VBG8IoCGvVg	{'funny': 0, 'useful': 0, 'cool': 0}
4	5UmKMjUEUNdYWqANhGckJw	2016-04-08	AEyiQ_Y44isJmNbMTyoMKQ	2	First the only reason this place could possibl...	review	qEE5EvV-f-s7yHC0Z4ydJQ	{'funny': 0, 'useful': 1, 'cool': 0}

	stars
business_id
--0ZoBTQWQra1FxD4rBWmg	2.000000
--1emggGHgoG6ipd_RMb-g	3.333333
--4Pe8BZ6gj57VFL5mUE8g	2.750000
--5jkZ3-nUPZxUvtcbr8Uw	4.580357
--7PRjnsjMA6uhPK8mW13Q	2.666667